Jin_hp2

Author

Becky Jin

Published

April 27, 2024

Introduction

This report is intended to observe and analyze the circumstances of trees in NYC through exploring the tree censuses conducted by New York City Department of Parks & Recreation in 1995, 2005, and 2015. Trees as one of the essential components of urban landscape provide a city with extensive vibrancy and vitality. Therefore, monitoring the condition of trees should have been a significant task.

About Dataset

There are 3 separate datasets regarding the tree conditions given, which were collected at 3 times with a 1-decade interval in between them as aforementioned. There is also another dataset about tree species which involve detailed information about each specific species of tree that grow in NYC. The number of trees across the 3 time phases demonstrates a slight increase but with the majority overlapped. Thus, since the 2015 census is the most recent one, I would mainly concentrate on that for analysis.

Setup

options(repos = c(CRAN = "https://cloud.r-project.org"))
#| warning: false
suppressPackageStartupMessages({
  library(tidyverse)
  library(here)
  library(knitr)
  library(kableExtra)
  library(ggplot2)
  library(plotly)
  library(gridExtra)
})

Since the dataset csv files except the tree species one are all over 100MB, it would be efficient to rather read them into independent rds files once for all.

here::i_am("Jin_hp2.qmd")
here() starts at /Users/apple/Desktop/ling343-files/ling343-hp2
#| warning: false
#df_tree_species <- read.csv(here("new_york_tree_species.csv"))
#df_trees_1995 <- read.csv(here("new_york_tree_census_1995.csv"))
#df_trees_2005 <- read.csv(here("new_york_tree_census_2005.csv"))
#df_trees_2015 <- read.csv(here("new_york_tree_census_2015.csv"))
#write_rds(df_tree_species, "nyc_tree_species.rds")
#write_rds(df_trees_1995, "nyc_tree_census_1995.rds")
#write_rds(df_trees_2005, "nyc_tree_census_2005.rds")
#write_rds(df_trees_2015, "nyc_tree_census_2015.rds")
df_species <- read_rds("nyc_tree_species.rds")
df_1995 <- read_rds("nyc_tree_census_1995.rds")
df_2005 <- read_rds("nyc_tree_census_2005.rds")
df_2015 <- read_rds("nyc_tree_census_2015.rds")

Brief Overview of Censuses in Parallel

colnames(df_1995)
 [1] "recordid"           "address"            "house_number"      
 [4] "street"             "zip_original"       "cb_original"       
 [7] "site"               "species"            "diameter"          
[10] "status"             "wires"              "sidewalk_condition"
[13] "support_structure"  "borough"            "x"                 
[16] "y"                  "longitude"          "latitude"          
[19] "cb_new"             "zip_new"            "censustract_2010"  
[22] "censusblock_2010"   "nta_2010"           "segmentid"         
[25] "spc_common"         "spc_latin"          "location"          
colnames(df_2005)
 [1] "objectid"   "cen_year"   "tree_dbh"   "tree_loc"   "pit_type"  
 [6] "soil_lvl"   "status"     "spc_latin"  "spc_common" "vert_other"
[11] "vert_pgrd"  "vert_tgrd"  "vert_wall"  "horz_blck"  "horz_grate"
[16] "horz_plant" "horz_other" "sidw_crack" "sidw_raise" "wire_htap" 
[21] "wire_prime" "wire_2nd"   "wire_other" "inf_canopy" "inf_guard" 
[26] "inf_wires"  "inf_paving" "inf_outlet" "inf_shoes"  "inf_lights"
[31] "inf_other"  "trunk_dmg"  "zipcode"    "zip_city"   "cb_num"    
[36] "borocode"   "boroname"   "cncldist"   "st_assem"   "st_senate" 
[41] "nta"        "nta_name"   "boro_ct"    "x_sp"       "y_sp"      
[46] "objectid_1" "location_1"
colnames(df_2015)
 [1] "tree_id"    "block_id"   "created_at" "tree_dbh"   "stump_diam"
 [6] "curb_loc"   "status"     "health"     "spc_latin"  "spc_common"
[11] "steward"    "guards"     "sidewalk"   "user_type"  "problems"  
[16] "root_stone" "root_grate" "root_other" "trunk_wire" "trnk_light"
[21] "trnk_other" "brch_light" "brch_shoe"  "brch_other" "address"   
[26] "zipcode"    "zip_city"   "cb_num"     "borocode"   "boroname"  
[31] "cncldist"   "st_assem"   "st_senate"  "nta"        "nta_name"  
[36] "boro_ct"    "state"      "latitude"   "longitude"  "x_sp"      
[41] "y_sp"      

As can be seen from the summary of columns in each census, there are more or less variables measured and recorded.
As of 1995, there are 27 columns and 516989 observations.
As of 2005, there are 47 columns and 1777116 observations.
As of 2015, there are 41 columns and 683788 observations.
The number of tree observations in 2005 census is abruptly high. If we take a closer look at the actual data frame, it can be noted that there are 2 pieces of information for each tree which are not formatted correctly then unnecessarily take 2 extra rows per actual tree observation.

head(df_2005[, c(1,2,3)], 6)
                                  objectid cen_year tree_dbh
1                                  1164781     2005        9
2                                 New York       NA       NA
3 (40.557117429000002 -74.158024325000000)       NA       NA
4                                  1017551     2006        7
5                                 New York       NA       NA
6 (40.771243948399999 -73.911987843600002)       NA       NA

Let’s tidy this into 1 single row per observation with the current 2nd, 5th rows that merely state “New York” to be removed and the coordinates to be added as a new column.

df_2005 <- df_2005 %>%
  mutate(row_idx = row_number()) %>% 
  select(row_idx, everything())
df_2005 <- df_2005 %>% 
  mutate(coordinates = ifelse(row_idx %% 3 == 0, objectid, NA)) %>%
  fill(coordinates, .direction = "up")
df_2005 <- df_2005 %>%
  filter(!(row_idx %% 3 == 0 | (row_idx + 1) %% 3 == 0)) %>%
  select(-row_idx)
df_2005 <- df_2005 %>%
  separate(coordinates, into = c("latitude", "longitude"), sep = " -", extra = "merge") %>%
  mutate(
    latitude = gsub("[()]", "", latitude),
    longitude = gsub("[()]", "", longitude)
  )
Warning: Expected 2 pieces. Missing pieces filled with `NA` in 8840 rows [181, 398, 452,
508, 517, 668, 738, 761, 766, 800, 802, 808, 1011, 1320, 1342, 1398, 1639,
1748, 1765, 1792, ...].
head(df_2005[, c("objectid", "latitude", "longitude"), 6])
  objectid           latitude          longitude
1  1164781 40.557117429000002 74.158024325000000
2  1017551 40.771243948399999 73.911987843600002
3   730461 40.685647932400002 73.981510094699999
4   718618 40.619671549499998 74.018853453600002
5   926349 40.744177770000000 73.857010341099993
6   741134 40.676315822399999 73.982255347000006

df_2005 now has its previous tree location coordinates in every 3rd row reorganized into 2 separate columns as latitude and longitude.
And df_2005 now has 592372 observations, which is reflective of the actual number of trees recorded.

Census of Trees in 2015

Data Dictionary

data_dict <- data.frame(Variable = c("tree_idx", "tree_dbh", "stump_diam", "curb_loc", "status", "health",
                                     "spc_latin", "spc_common", "steward", "guards", "sidewalk", "problems",
                                     "address", "zipcode", "borocode", "boroname", "nta", "nta_name",
                                     "latitude", "longitude"),
                        Explanation = c("unique index to distinguish each tree", 
                                        "diameter of a tree at breast height",
                                        "diameter of a tree that has been turned into a stump",
                                        "condition of curb applied to a tree {OnCurb | OffsetFromCurb}",
                                        "life status of a tree {Alive | Stump | Dead}",
                                        "health status of a tree {Good | Fair | Poor}", 
                                        "latin name of a tree's species",
                                        "common name of a tree's species",
                                        "whether a tree has stewards {None | 1or2 | 3or4}",
                                        "whether a tree has guards {Helpful | Harmful | None | Unsure}",
                                        "whether a tree affects nearby sidewalk {Damage | No Damage}",
                                        "the type of problem a tree has or none", 
                                        "address of where a tree is located",
                                        "zipcode of a tree's detailed location",
                                        "code of the borough where a tree is located",
                                        "name of the borough where a tree is located",
                                        "Neighborhood Tabulation Area designated by NYC Department of City Planning",
                                        "name of the Neighborhood Tabulation Area where a tree is at",
                                        "lat, indicated by its name",
                                        "long, indicated by its name"))
kable(data_dict, caption = "Data Dictionary")
Data Dictionary
Variable Explanation
tree_idx unique index to distinguish each tree
tree_dbh diameter of a tree at breast height
stump_diam diameter of a tree that has been turned into a stump
curb_loc condition of curb applied to a tree {OnCurb | OffsetFromCurb}
status life status of a tree {Alive | Stump | Dead}
health health status of a tree {Good | Fair | Poor}
spc_latin latin name of a tree’s species
spc_common common name of a tree’s species
steward whether a tree has stewards {None | 1or2 | 3or4}
guards whether a tree has guards {Helpful | Harmful | None | Unsure}
sidewalk whether a tree affects nearby sidewalk {Damage | No Damage}
problems the type of problem a tree has or none
address address of where a tree is located
zipcode zipcode of a tree’s detailed location
borocode code of the borough where a tree is located
boroname name of the borough where a tree is located
nta Neighborhood Tabulation Area designated by NYC Department of City Planning
nta_name name of the Neighborhood Tabulation Area where a tree is at
latitude lat, indicated by its name
longitude long, indicated by its name
species_count <- df_2015 %>% 
  filter(!is.na(spc_common) & spc_common != "") %>% 
  group_by(spc_common) %>% 
  summarize(count = n()) %>% 
  rename(tree_species = spc_common) %>% 
  arrange(desc(count))

species_count$tree_species <- reorder(species_count$tree_species, -species_count$count)
spc_top_bar_chart <- ggplot(head(species_count, 30), 
                        aes(x = tree_species, y = count, fill = count)) +
    geom_bar(stat = "identity") +
    scale_fill_gradient(low = "lightgreen", high = "darkgreen", "Count") +
    labs(title = "Number of Trees of Top 30 Species in NYC 2015", x = "Tree Species", y = "Count") +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 80, vjust = 0.5, hjust = 1))
spc_top_plotly <- ggplotly(spc_top_bar_chart) 
spc_top_plotly
species_count <- species_count %>% 
  arrange(count)
species_count$tree_species <- reorder(species_count$tree_species, species_count$count)

spc_few_bar_chart <- ggplot(head(species_count, 30), 
                        aes(x = tree_species, y = count, fill = count)) +
    geom_bar(stat = "identity") +
    scale_fill_gradient(low = "yellow", high = "#F97A27", "Count") +
    labs(title = "Number of Trees of Rarest 30 Species in NYC 2015", x = "Tree Species", y = "Count") +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 80, vjust = 0.5, hjust = 1))
spc_few_plotly <- ggplotly(spc_few_bar_chart) 
spc_few_plotly

Tree diameter (tree_dbh) or stump diameter (stump_diam)

tree_or_diam <- df_2015 %>% 
  filter(!(tree_dbh == 0 & stump_diam == 0))
summary(tree_or_diam$tree_dbh[tree_or_diam$tree_dbh != 0])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00    5.00   10.00   11.58   16.00  450.00 
summary(tree_or_diam$stump_diam[tree_or_diam$stump_diam != 0])    
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00    7.00   14.00   16.75   23.00  140.00 
count_non_zero_stump_diam <- sum(tree_or_diam$stump_diam > 0) 
count_non_zero_stump_diam 
[1] 17654
count_zero_tree_dbh <- sum(tree_or_diam$tree_dbh == 0) 
count_zero_tree_dbh
[1] 17654

Excluding the observations that have both tree_dbh and stump_diam columns entered as 0, it can be compared to see that the number of trees as a stump with stump_diam > 0 and the number of trees not in normal status with tree_dbh = 0 are equal. It means that either a tree is in its normal tree status or has been turned into a stump for some unknown reason.
Thus, as of 2015, there are 665856 trees and 17654 stumps, despite the trees with neither diameter logged.

Next, we may look into another relevant facet of the given tree observations, the health status.

status_tb <- table(df_2015$status)
status_df <- data.frame(Status = names(status_tb), Count = as.vector(status_tb))
status_df <- status_df %>% 
  arrange(desc(Count))
colors <- c("#83E16E", "#ff8a00", "#DFC336")
life_status_pie <- plot_ly(status_df, labels = ~Status, values = ~Count, type = 'pie', textinfo = 'label+percent',
        insidetextorientation = 'radial',
        marker = list(colors = colors)) %>%
        layout(title = list(text = 'Pie Chart of Tree Life Status in 2015', x = 0, xanchor = 'left', 
                            font = list(size = 14)), 
              margin = list(l = 40, r = 40, b = 40, t = 70))
life_status_pie
health_empty <- sum(trimws(df_2015$health) == '')
health_empty
[1] 31616

There are 31616 observations with no health value logged, which are then excluded.

health_df_filtered <- df_2015 %>%
  filter(health != '') %>%
  select(health)
health_tb <- table(health_df_filtered$health)
health_tb_df <- data.frame(Health_status = names(health_tb), Count = as.vector(health_tb))
colors2 <- c("#38AAC3", "#56AB5D", "#E8BE12")
health_pie <- plot_ly(health_tb_df, labels = ~Health_status, values = ~Count, type = 'pie',
                      textinfo = 'label+percent',
                      insidetextorientation = 'radial',
                      marker = list(colors = colors2)) %>% 
  layout(title = 'Pie Chart of Tree Health Status in 2015')
health_pie
status_combined <- df_2015 %>% 
  filter(status != '') %>% 
  select(tree_id, status, health)
status_combined <- status_combined %>% 
  rename(life_status = status, health_status = health)

There are 683788 tree observations which have both life_status and health_status values present.

status_comb_summary <- status_combined %>%
  group_by(life_status, health_status) %>%
  summarise(Count = n(), .groups = 'drop')
status_comb_summary <- status_comb_summary %>% 
  filter(!(life_status == 'Alive' & health_status == '')) %>% 
  arrange(desc(Count))
status_comb_kb <- kable(status_comb_summary, 
                        format = "html", 
                        caption = "Frequency of Combinations of Life_status and Health_status")
status_comb_kb
Frequency of Combinations of Life_status and Health_status
life_status health_status Count
Alive Good 528850
Alive Fair 96504
Alive Poor 26818
Stump 17654
Dead 13961

Did the specific problems listed in the census affect the trees’ either status?

no_log_tree_cnt <- sum(df_2015$problems == '')
no_problem_tree_cnt <- sum(df_2015$problems == 'None')
problem_tree_cnt <- sum(df_2015$problems != '' & df_2015$problems != 'None')

In general, there are 31664 trees with problem column not logged, 426280 trees with no problem, and 225844 trees with some problem to be further probed.

tree_problem_df <- df_2015[, c(2, 11, 12, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24)]
colnames(tree_problem_df)
 [1] "block_id"   "steward"    "guards"     "problems"   "root_stone"
 [6] "root_grate" "root_other" "trunk_wire" "trnk_light" "trnk_other"
[11] "brch_light" "brch_shoe"  "brch_other"

Since we have been aware that ‘problems’ column indicates whether a tree encounters some or none of the enumerated problems. It would be efficient for us to exclude those trees from our analysis of trees’ problems.

problem_cnt_df <- data.frame(Problem_type = c("Not Logged", "No Problem", "Problem"), 
                             Tree_count = c(no_log_tree_cnt, no_problem_tree_cnt, problem_tree_cnt))
problem_cnt_df <- problem_cnt_df %>% 
  arrange(Tree_count)
problems_bar_chart <- ggplot(problem_cnt_df, aes(x = Problem_type, y = Tree_count)) + 
  geom_bar(stat = "identity", fill = "#7FB3D5", width = 0.5) +
  labs(title = "Bar Plot of Tree Problems Count as of 2015", 
       x = "Problem_Type", 
       y = "Tree_Count") + 
  theme_minimal()
problems_bar_plotly <- ggplotly(problems_bar_chart)
problems_bar_plotly

If we take a deeper look into the subset of trees with problem, we would be able to see the proportions of each specific problem type.

specific_problem_type_cnt <- df_2015 %>%
  group_by(problems) %>%
  summarize(Count = n())
specific_problem_type_cnt <- specific_problem_type_cnt %>% 
  arrange(desc(Count))
specific_problem_type_cnt <- specific_problem_type_cnt[-c(1, 3), ]
#colnames(specific_problem_type_cnt)
specific_problem_type_cnt <- specific_problem_type_cnt %>% 
  rename(problem_type = problems)
problem_bar_chart <- ggplot(head(specific_problem_type_cnt, 25), 
                        aes(x = problem_type, y = Count, fill = Count)) +
    geom_bar(stat = "identity") +
    scale_fill_gradient(low = "#A9CCE3", high = "#1A5276", "Count") +
    labs(title = "Number of Trees with the Most Frequent 25 Combo of Problems", 
         x = "Problem", 
         y = "Count") +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 80, vjust = 0.5, hjust = 1, size = 8))
problem_plotly <- ggplotly(problem_bar_chart)
problem_plotly

As can be noted from the above chart, stones, branch lights are the two problems that occur the most among the trees that were investigated in NYC in 2015.

Among the 9 specific problems enumerated in the dataset, what are the proportions of each occurring or not?

create_new_df_from_column <- function(column) {
    counts <- table(column)
    new_df <- data.frame(bool = names(counts), Count = as.integer(counts))
    names(new_df) <- c("bool", "Count")
    return(new_df)
}
colors <- c("#B4720B", "#8FBF53")
new_problem_pie <- function(df) {
  df$Percentage <- df$Count / sum(df$Count) * 100
  new_pie <- ggplot(df, aes(x = "", y = Count, fill = factor(bool))) +
    geom_bar(stat = "identity", width = 1) +
    coord_polar(theta = "y") +
    theme_void() +
    labs(fill = "Bool") +  
    theme(legend.title = element_text(size = 7), legend.text = element_text(size = 7)) +
    geom_text(aes(label = sprintf("%0.1f%%", Percentage)), position = position_stack(vjust = 0.5), size = 3) +
    scale_fill_manual(values = colors)
    return (new_pie)
}


3 Root Problems

root_stone_df <- create_new_df_from_column(tree_problem_df[5])
root_grate_df <- create_new_df_from_column(tree_problem_df[6])
root_other_df <- create_new_df_from_column(tree_problem_df[7])
root_stone_pie <- new_problem_pie(root_stone_df)
root_grate_pie <- new_problem_pie(root_grate_df)
root_other_pie <- new_problem_pie(root_other_df)
grid.arrange(root_stone_pie, root_grate_pie, root_other_pie, ncol = 3,
             top = "Combined Pie Charts for 3 Types of Root Problems")

The above 3 pie charts are for problems root_stone, root_grate, and root_other from left to right.
For instance, 20.5% of all trees in 2015 census had no root_stone problem while the rest 79.5% did have.

3 Trunk Problems

trunk_wire_df <- create_new_df_from_column(tree_problem_df[8])
trunk_light_df <- create_new_df_from_column(tree_problem_df[9])
trunk_other_df <- create_new_df_from_column(tree_problem_df[10])
trunk_wire_pie <- new_problem_pie(trunk_wire_df)
trunk_light_pie <- new_problem_pie(trunk_light_df)
trunk_other_pie <- new_problem_pie(trunk_other_df)
grid.arrange(trunk_wire_pie, trunk_light_pie, trunk_other_pie, ncol = 3,
             top = "Combined Pie Charts for 3 Types of Trunk Problems")

The above 3 pie charts are for problems trunk_wire, trunk_light, and trunk_other from left to right.

3 Branch Problems

branch_light_df <- create_new_df_from_column(tree_problem_df[11])
branch_shoe_df <- create_new_df_from_column(tree_problem_df[12])
branch_other_df <- create_new_df_from_column(tree_problem_df[13])
branch_light_pie <- new_problem_pie(branch_light_df)
branch_shoe_pie <- new_problem_pie(branch_shoe_df)
branch_other_pie <- new_problem_pie(branch_other_df)
grid.arrange(branch_light_pie, branch_shoe_pie, branch_other_pie , ncol = 3,
             top = "Combined Pie Charts for 3 Types of Branch Problems")

The above 3 pie charts are for problems branch_light, branch_shoe, and branch_other from left to right.

Observation: Comparing the 3 sets of pie charts in parallel, it can be concluded that root_stone from set 1 and branch_light from set 3 are the 2 problems with the highest proportions of “Yes” which denotes that a tree observation did have that particular problem. Also, the results demonstrated in these pie charts are in line with the results computed from counting the frequency of different problem types using the overall problem_type column.

Whether having steward and/or guards better protects the trees?

steward_tb <- table(tree_problem_df$steward)
steward_tb_prop <- prop.table(steward_tb)
steward_tb

           1or2    3or4 4orMore    None 
  31615  143557   19183    1610  487823 
steward_tb_prop

                   1or2        3or4     4orMore        None 
0.046235090 0.209943725 0.028054017 0.002354531 0.713412637 
guards_tb <- table(tree_problem_df$guards)
guards_tb_prop <- prop.table(guards_tb)
guards_tb

        Harmful Helpful    None  Unsure 
  31616   20252   51866  572306    7748 
guards_tb_prop

              Harmful    Helpful       None     Unsure 
0.04623655 0.02961737 0.07585099 0.83696409 0.01133100 
protection_comb_summary <- tree_problem_df %>%
  filter(!tree_problem_df$steward == '' & !tree_problem_df$guards == '') %>% 
  group_by(steward, guards) %>%
  summarise(Count = n(), .groups = 'drop') %>% 
  arrange(desc(Count))
kable(protection_comb_summary)
steward guards Count
None None 474562
1or2 None 94101
1or2 Helpful 32205
1or2 Harmful 12626
3or4 Helpful 11451
None Helpful 7234
1or2 Unsure 4625
None Harmful 4022
3or4 Harmful 3422
3or4 None 3286
None Unsure 2004
3or4 Unsure 1024
4orMore Helpful 976
4orMore None 357
4orMore Harmful 182
4orMore Unsure 95

Conclusion

Throughout this report, I tried to analyze several segments of the census concerning trees in NYC in 2015. The dataset provided has a number of variables measured and recorded for the trees as observations. Accordingly, there could be a lot to unpack and map, whereas the report here could only incorporate analysis of limited facets adapted from the census. I started by looking at the general aspects of the trees, such as diameter (dbh) and life status, and then went deeper into the particularities that possibly could have some correlation in between, which doesn’t turn out to be explicit. Overall, it’s quite interesting to explore the tree dynamics in a metropolis.

Dataset Reference

https://www.kaggle.com/datasets/nycparks/tree-census/data